Enriching a French Treebank
نویسندگان
چکیده
This paper presents the current status of the French treebank developed at Paris 7 (Abeillé et al., 2003a). The corpus comprises 1 million words from the newspaper le Monde, fully annotated and disambiguated for parts of speech, inflectional morphology, compounds and lemmas, and syntactic constituents. It is representative of contemporary normalized written French, and covers a variety of authors and subjects (economy, literature, politics, etc.), with extracts from newspapers ranging from 1989 to 1993. It has been used by computational linguists to train and evaluate taggers, parsers and lemmatizers, as well as by psycholinguists to extract lexical and syntactic preferences (Pynte et al., 2001). It is now being enriched with functional information, and used for parsing evaluation. 1. The French treebank Similarly to the Penn TreeBank, we have annotated both parts of speech and constituents. Differently from the Penn Treebank, we have also annotated compounds, lemmas and inflectional morphology. Our annotation choices are meant to be linguistically motivated and compatible with various linguistic theories. We have chosen surface-based annotations, with no empty categories (Abeillé and Clément, 2002; Abeillé et al., 2003b; Abeillé, 2003). With compounds amalgamated and not counting punctuation marks, the treebank comprises 870 000 tokens, using 37 000 different lemmas, making up about 32 000 independent sentences. The average number of words per sentence is 27 and the average number of phrases is 20 (some phrases are unary). It has been automatically tagged and hand-corrected by human annotators in a first phase, and automatically chunked and hand-corrected in a second phase (Clément, 2001; Toussenel, 2001; Abeillé et al., 2003a). In the first phase, the task of the annotators was to validate the sentence boundaries, as well as the compounds (for missing compounds or possible compounds irrelevant in a given context), and to validate the morpho-syntactic tags, especially for notoriously difficult cases (for example as a preposition or as a determiner). In the second phase, the annotators’ task was to validate the constituant labels and boundaries, adding embedding where appropriate, as well as to signal remaining errors which could have been overlooked in the first phase. They used a specific Emacs-based annotation tool. The annotated and validated corpus is formatted in XML, using the XCES recommendations, and is available for research purposes. We distinguish 14 lexical categories, used for simple words as well as for compounds: A (adjective), Adv (Adverb), CC (coordinating conjunction), CL (weak clitic pronoun), CS (subordinating conjunction), D (determiner) ET (foreign word), I (interjection), NC (common noun), NP (proper name), P (preposition), PRO (strong pronoun), V (verb), PONCT (punctuation mark). We distinguish 12 phrasal categories: AP (adjectival phrase), AdP (adverbial phrase), COORD (coordinated phrase), NP (noun phrase), PP (preposition phrase), VN (verbal nucleus), VPinf (infinitival clause), VPpart (participial clause), SENT (independent clause), Sint (parenthetical), Srel (relative clause), Ssub (other subordinated clause) We chose to only annotate major phrases, with little internal structure (we have determiners and modifying adjectives at the same level in the noun phrase for example). For the sake of simplicity, we make a parsimonious use of unary phrases. For rigid sequences of categories, such as dates or titles, it is difficult to determine the head, and we have one global NP with no internal constituents. For coordinations, we have a COORD phrase, for the conjunction and the non initial conjuncts) usually included inside a major phrase (headed by the initial conjunct). We do not have discontinuous constituents, since these can usually be recovered at the functional level : in Combien voulez-vous de pommes (lit. how many do you want of apples ?) both and de pommes have the same Object function. Most of the difficult cases were with PP attachment, or scope of coordination, and human annotators had to spend the necessary time to fully understand the sentences. We got rid of spurious ambiguities (with the same interpretation) by a Attach high heuristics, for example in support verb constructions such as écrire un livre sur les indiens (write a book about Indians) where the PP complement passes the linguistic tests both as a complement of the Verb and as a complement of the preceding Noun, with no semantic difference. 2. Enrichment of the treebank 2.1. Enriching the treebank with grammatical functions Similarly to what has been done for the German Negra or Tiger Treebanks (Brants et al., 2003), we have added some functional information to the French treebank. We chose to annotate surface grammatical functions only, and mark them as labels on the phrasal categories. For clitics, we mark the corresponding functions on the verbal nucleus. Functional information such as complement (or modifier) of Noun or complement of Adjective is already implicit in the constituent hierarchy (or in the constituent label for relative clauses). So we have concentrated on the functional tagging of verbal dependents, for which this information was not available. We distinguish 8 grammatical functions: A-object (A-OBJ), Subject predicate (ATS), Object predicate (ATO), De-object (DE-OBJ), Direct object (OBJ), Modifier (MOD), Prepositional object (P-OBJ), Subject (SUJ). We only annotate surface functions: the subject of passive verbs for example bears a Subject function, not an Object one. Phrases have at most one function: in case of infinitival constructions, we only note the surface function of the NP complement (with respect to the main V) and not its “deep” subject function (with respect to the infinitival V). In Je vois Paul partir (I see Paul leaving), the NP is annotated as the direct Object of the V , not as the Subject of the Vinf . On the other hand, two constituents can have the same function in the same sentence. It is the case with inverted clitics which are compatible with an NP subject in French. In Paul part-il ? (lit. Paul does he leave ?) both the NP and the following VN are tagged with a Subject function. Discontinuous dependents are another case of independent constituents tagged with the same function (such as the Object pronouns “ ” and “quelques uns” in On en a pris quelques uns (lit. We them have taken some)). For verbal nuclei (VN), we annotate functions of the clitic pronouns included in the VN, such as Subject for “ ”, Direct object for “ ”, etc. The grammatical functions are automatically added to the constituents (which are VN or sisters of VN) by a functional tagger developed by Jacques Steinlin and Nicolas Barrier, and then hand-corrected. It is rule-based, written in JAVA, using the XERCES API and 115 rules which are unification-based and fully ordered. The rules define underspecified patterns against which the corpus trees are matched to assign the correct function to a given constituent and allow for default assignment. We have evaluated it against a sample of 1000 handcorrected sentences (picked randomly from the corpus). It performs with an average precision of 89,69% (best precision for subjects: 99,47%) and an average recall of 89,27% (best recall for modifiers: 95,48%) (cf. Table 1). Annotators are currently validating the functional tagging, using an enhanced version of our Emacs-based validation tool1. Human validation is significantly easier than in the previous annotations phases: only a subset of the constituents has to be considered, and it mostly involves understanding the sentence. Difficult choices imply distinguishing predicative complements from objects, and modifiers from prepositional objects. For the former, we use a list of verbs taking predicative complements, for the latter we ask the annotators to conform to linguistically available tests (modifiers are more mobile than complements, only complements can be obligatory, etc.). A distribution of the different functions among the different constituents has been computed on the same 1000 sample sentences and is presented in Table 2. Notice that certain functions are not defined for certain constituents: no NP can be an a-object, no PP can be a subSo far, about 20% of the corpus has been validated for functional tagging. ject. On the other hand, the lack of Object predicate NP in Table 2 is only due to the small size of the sample (a valid example would be On l’a élu président ‘we have elected him president’). More surprising cases are adjectival objects, such as “peser lourd” (to weigh heavy) or locative NPs annotated as prep-objects, such as “aller place Beauveau” (to go place Beauveau). Notice that, contrary to what is usually found in spoken French, nominal subjects are the most frequent ones (clitic subjects are annotated as VN). Notice also that adverbial phrases may be underestimated because we do not have unary adverbial phrases (we only annotate AdP with at least two elements). In case of coordination, we only annotate the embedding phrase, and not the embedded COORD. We annotate COORD phrases only when they are not embedded, that is the case with “multiple conjunctions” such as: Et le Maroc et l’Algérie réussiront (lit. And Marocco and Algeria will-succeed). 2.2. Using the enriched treebank A small subset of the new treebank with functional information is being used in the French project EASY for parsers evaluation (Gendner et al., 2003). EASY defines a relation-based annotation scheme inspired from (Carroll et al., 2003). In order to convert our treebank into this richer format, we define a two-step conversion procedure : first our constituents are split into smaller chunks, then our functional tags (or levels of embeddings) are converted into sets of dependency relations between chunks, with a grammatical function. The first step is done automatically. The second step is performed semi-automatically with some human validation. Notice that in EASY, the functions are annotated as relations between chunks, or between words and chunks. For embedded constituents (not verbal dependents), the dependency relations can be easily read off the tree structure: a PP inside an NP for example bears a MOD N relation with the head Noun of the NP, a PP inside an AP bears a MOD A relation with the head Adjective in the AP etc. The only information to be added is that of headedness, and most of the time heuristics such as the first N (for a NP), the first A (for an AP) ... are sufficient. For verbal dependents, our functional tags are converted into binary relations. Long distance relations (when an NP object for example bears a relation not with the following VN but with a more distant one, as in “Que voulez-vous dire ?” What do you mean to say ?) have to be added by hand (although some automatization could be considered) as well as control relations.
منابع مشابه
Enrichissement du FTB : un treebank hybride constituants/propriétés (Enriching the French Treebank with Properties) [in French]
Enriching the French Treebank with Properties We present in this paper the hybridation of the French Treebank with Property Grammars annotations. This process consists in acquiring a PG grammar from the source treebank and generating the new syntactic encoding on top of the original one. The result is a new resource for French, opening the way to new tools and descriptions. MOTS-CLÉS : Treebank...
متن کاملEnriching the Syntactic Annotation of Korean Treebanks for Higher-level Processing: A Comparative Study of the Penn Korean Treebank and the 21st Sejong Korean Treebank
متن کامل
A Named Entity recognizer for French (Un reconnaisseur d'entités nommées du Français) [in French]
We propose to demonstrate a french named entity recognizer trained on the French TreeBank enriched with named entity annotations. Mots-clés : REN, POS, apprentissage automatique, French Treebank, extraction d’information, CRF.
متن کاملTowards a treebank of spoken French (Vers un treebank du français parlé) [in French]
Towards a treebank of spoken French We present the first results of an attempt to build a spoken treebank for French. It has been conducted as part of the ANR project Etape (resp. G. Gravier). Contrary to other languages such as English (see the Switchboard treebank (Meteer, 1995)), there is no sizable spoken corpus for French annotated for syntactic constituents and grammatical functions. Our ...
متن کاملFROM TREEBANK RESOURCES TO LFG F-STRUCTURES Automatic F-Structure Annotation of Treebank Trees and CFGs Extracted from Treebanks
We present two companion methods for automatically enriching phrase-structure oriented treebank resources with functional structures. Both methods define systematic patterns of correspondence between partial PS configurations and functional structures. These are applied to PS rules extracted from treebanks, or to flat term representations of treebank trees.
متن کاملAnnotation référentielle du Corpus Arboré de Paris 7 en entités nommées (Referential named entity annotation of the Paris 7 French TreeBank) [in French]
Referential named entity annotation of the Paris 7 French TreeBank The French TreeBank developed at the University Paris 7 is the main source of morphosyntactic and syntactic annotations for French. However, it does not include explicit information related to named entities, which are among the most useful information for several natural language processing tasks and applications. Moreover, no ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2004